Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparation Script for Training on Mozilla Commonvoice #111

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

RuntimeRacer
Copy link
Contributor

@RuntimeRacer RuntimeRacer commented May 1, 2023

This PR provides an end-to-end preparation script for Mozilla CommonVoice.

I built it by copying over the Scripts from AIShell and combining it with the preparation scripts for commonvoice found in Icefall which is also using Lhotse. References:

Some additional Info and stats:

  • The data for the 24 languages included in the script (there's even more available in the full CV Corpus) is 432G and downloading and extracting the archives took about 12h with my 200 Mbps connection, using a Raid 0 drive consisting of 2xPCI-E x4 .M2 SSDs.
  • Preparation + Tokenization also took about another day.
  • I had to cut down the train/dev datasets of all the languages downloaded to use 400 sample each from their dev and train subsets, because otherwise it would have become too big and get stale in a loop on the validation step. Even with the now 9.600 cuts per dev/train set, it takes about 30 seconds to calculate validation loss. In case you want to train on smaller subset of languages, you may want to increase that number or use a complete train/dev set from that language(s).
  • I was able to run inference training fine with up to 5 GPUs, however there seems to be still a bug in the validation calculation (training problem #86), which required me to only use 1 card as of now, and I hit an OOM error (Cuda OOM error when "saving batch" #110) after ~164k steps, probably due to max-duration 80 being too high for this dataset (running on RTX 3090 24GB).

Since I did not finish training yet, I cannot provide any sample models, results or stats at this point.

@RuntimeRacer RuntimeRacer mentioned this pull request May 1, 2023
@eschmidbauer
Copy link

anyway you can limit this to english only? I tried this branch and it filled up my disk.

@RuntimeRacer
Copy link
Contributor Author

anyway you can limit this to english only? I tried this branch and it filled up my disk.

For english only just limit the language list here to contain only "en": https://github.com/lifeiteng/vall-e/pull/111/files#diff-9c086567a8bee92cd4ae661ae5d75be66ae5340f8982cb287a09a78aee2041bdR22
You might want to comment out this lines also: https://github.com/lifeiteng/vall-e/pull/111/files#diff-9c086567a8bee92cd4ae661ae5d75be66ae5340f8982cb287a09a78aee2041bdR116-R125
And remove the _subset from the test / dev datasets, too, so you have a bigger test and validation dataset: https://github.com/lifeiteng/vall-e/pull/111/files#diff-9c086567a8bee92cd4ae661ae5d75be66ae5340f8982cb287a09a78aee2041bdR134-R135

@eschmidbauer
Copy link

Thanks! I will give it a try

@@ -0,0 +1 @@
甚至 出现 交易 几乎 停滞 的 情况
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update prompts

@lifeiteng
Copy link
Owner

Inorder to get reasonable result, we need design the multi-language Symbol set, work with Language ID.

cutsDevList+="${audio_feats_dir}/commonvoice_cuts_${lang}_dev_subset.jsonl.gz "
cutsTestList+="${audio_feats_dir}/commonvoice_cuts_${lang}_test_subset.jsonl.gz "
done
# echo "${cutsTrainList}" # debug
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean comments

@lifeiteng
Copy link
Owner

@RuntimeRacer please update prompts.

Can you share the results here?

@RuntimeRacer
Copy link
Contributor Author

@RuntimeRacer please update prompts.

Can you share the results here?

@lifeiteng I just started training the NAR model today; I will share results in a bit once first few epochs have been completed.

Will also update comments then.

@pawel-polyai
Copy link

@RuntimeRacer - any updates on the training performance?

@RuntimeRacer
Copy link
Contributor Author

@pawel-polyai It's currently training NAR Epoch 4 on 6x RTX 3090, after training 10 Epochs for AR; Intermediate Results are mediocre so far; it is able to Synthesize Speech (Tested only English and German), however it is still not able to fully maintain Speaker Identity nor accent.

For Example, after NAR Epoch 1 it spoke with seemingly slavic accent; After 2 and 3 it Changed to French for some reason; so not sure how precise it can get yet, nor if the Accent it is speaking with is coming from an attempted transfer of Speaker Identity, randomly based on last trained Training data, or Dataset Bias.

However The Loss is still decreasing for NAR and I'll keep you updated. Sharing Traning graphs here as well:
image

Also sharing my (very not-in-depth) Examples; Only tested with one speaker which I found TTS Models to have a hard time replicating in the past; and also the intermediate models NAR Epoch 1-3: https://drive.google.com/drive/folders/1-bCwvXdXd4O2NOBigoXVdArAnZoigvWc?usp=sharing

If you want to play around with it yourself, you can perform inference with these commands:
python bin/infer.py --output-dir ./exp/valle/results --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "You're in the presence of Suzette!" --audio-prompts ./prompts/Suzette_crop1.wav --text "I am the magnificent dark princess of the netherworld." --checkpoint exp/valle/epoch-3.pt

python bin/infer.py --output-dir ./exp/valle/results --model-name valle --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --text-prompts "You're in the presence of Suzette!" --audio-prompts ./prompts/Suzette_crop1.wav --text "Ich bin die großartige Prinzessin der Ätherwelt." --checkpoint exp/valle/epoch-3.pt

@chenjiasheng
Copy link
Collaborator

Thank you for your detailed and valuable share.
The sharp swings of the your train and valid loss/acc curves seem abnormal.
Generally, after 400k steps of training, even for large data sets, the NAR ACC will be more than 68%.
So there may be some issues in your data pipeline?
@lifeiteng What do you think?

@RuntimeRacer
Copy link
Contributor Author

RuntimeRacer commented May 19, 2023

Thank you for your detailed and valuable share. The sharp swings of the your train and valid loss/acc curves seem abnormal. Generally, after 400k steps of training, even for large data sets, the NAR ACC will be more than 68%. So there may be some issues in your data pipeline? @lifeiteng What do you think?

I train this model on 20 different languages. So I believe it has issues handling some of these, or the dialects of a certain subset of the data. I believe it is still improving in accuracy though.
If it turns out to come about badly / inaccurate after 10 epochs still; I will stop the multilang experiment and continue with librilight english only instead.

@lifeiteng
Copy link
Owner

VALL-E(this repo) focus on single language, in order to support mulit-lang, we should design the some experiments and verify them.

If this PR can get reasonable results, I'm OK to approve it.
Anyway, Thanks to @RuntimeRacer contrib.

@chenjiasheng we can synthesize some audio to judge the effectiveness of the model & data pipeline.
66% seems fine, the model haven't converge

@RuntimeRacer
Copy link
Contributor Author

VALL-E(this repo) focus on single language, in order to support mulit-lang, we should design the some experiments and verify them.

If this PR can get reasonable results, I'm OK to approve it. Anyway, Thanks to @RuntimeRacer contrib.

@chenjiasheng we can synthesize some audio to judge the effectiveness of the model & data pipeline. 66% seems fine, the model haven't converge

Yes it is still in the process of converging; I believe even the 10 Epoch AR Model was way from being fully converged; so it might be worth the effort to do another Follow-up training on Stage 1 / AR model.

@lifeiteng Do you think I can just continue AR training independently later despite NAR model has already been trained; and after like ~10 more Epochs AR (which would be 20 in total) do another 10 Epochs on NAR for fine-tune?

So current Iteration would be 10 AR / 10 NAR

and later 20 AR / 20 NAR for example

@lifeiteng
Copy link
Owner

@RuntimeRacer NAR need more epochs than AR. You can switch to train AR.
Now you can try synthesis audio by infer.py --continoul which can verify if NAR model works.

@RuntimeRacer
Copy link
Contributor Author

RuntimeRacer commented May 20, 2023

@lifeiteng I am confused now. I tried to restart AR Training from the checkpoint I already had trained NAR on. Used this command:
python3 bin/trainer.py --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 --num-buckets 6 --dtype bfloat16 --save-every-n 2500 --valid-interval 2500 --model-name valle --share-embedding true --norm-first true --add-prenet false --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 --base-lr 0.05 --warmup-steps 200 --average-period 0 --num-epochs 40 --start-epoch 11 --start-batch 0 --accumulate-grad-steps 4 --world-size 6 --keep-last-k 40 --exp-dir exp/valle --manifest-dir /workspace/kajispeech-v2/commonvoice/data/tokenized --text-tokens /workspace/kajispeech-v2/commonvoice/data/tokenized/unique_text_tokens.k2symbols --oom-check false

It also said it loaded from the existing epoch-10.pt file, which contains 5 epochs of NAR.

But now after a few hours I checked Graphs and outputs, and it kinda started completely from scratch now:
image

All the checkpoints contain only AR weights according to size, and started off from step 0.
However the graph implicates to me that it in fact did not start completely from scratch I believe; it converges from loss 2.7 - but not fully sure:
image

And well, NAR weights seem to be completely erased in the new checkpoints according to size.

@chenjiasheng
Copy link
Collaborator

chenjiasheng commented May 21, 2023

@RuntimeRacer
Hey, bro, please just train AR and NAR independently, using stage=1 and stage=2 respectively.
No need to rely on the tricky checkpoint transfer mechanism between AR and NAR.

@lifeiteng
How about let's disable train_stage 0, and add both AR and NAR checkpoint to args of infer.py, instead of a single merged checkpoint?
I see many people confused here, including myself two weeks ago.
Maybe @RuntimeRacer could make a PR? If not, I will.

@lifeiteng
Copy link
Owner

@RuntimeRacer @chenjiasheng Yes, we can do better!

There exists a trick in current impl.,
First --train-stage 1 -> best-valid.pt
Then, cp best-valid.pt to epoch-2.pt, the train with --start-epoch 3=2+1, which reload weights trained with --train-stage 1 then start train NAR weights

PR is welcome! I didn't have time to hands on it now.

@RuntimeRacer
Copy link
Contributor Author

Small update on AR model training progress:
image
It's chewy but continues to converge. I'll keep it running now until valid loss stops converging eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants